Memory optimized dists_add_symmetric #18
Merged
+6
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am proposing PR that fixes the old issues that mention 'CUDA out of memory error' upon running the evaluation script.
I figured out that this issue comes from a single function;
It is
cosypose.lib3d.distances.dist_add_symmetric
It allocates tensors of sizes NxNx3 and NxNx1, where N is the number of points.
Yet the same could be achieved by rewriting the code a little bit.
Alternative solutions are also possible and working(tested).
float16
Those approaches are tested but not used in the current proposal. Yet, they could be added later if the issue arises again with larger point clouds or the requirement to run on constrained hardware.
Also, some distance functions from lib3d.symmetric_distances.py file could be optimized, as they compute similar distance functions.
This solution uses <0.25 of the original version's memory:
An experiment was performed on run_cosy_pose_eval.py pipeline.
Evaluated 30 objects from
tless.bop
version of the dataset.The experiment with RTX-2080(8Gb) was not clean because GPU was also used for system GUI runtime.
The scenario with TITAN-X(12Gb) was much cleaner - performed on a headless server.
The old version of the code fails on both setups, while the new one works on both.
The low threshold on use cases could be explained by memory usage for context data and fragmentation. The error is triggered by the requirement to allocate one very large contiguous Tensor. This PR fixes the